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Abstract. The shortest common superstring and the shortest common superse- 
quence are two well studied problems having a wide range of applications. In this 
paper we consider both problems with resource constraints, denoted as the Re- 
stricted Common Superstring (shortly RCSstr) problem and the Restricted Com- 
mon Supersequence (shortly RCSseq). In the RCSstr (RCSseq) problem we are 
given a set S of n strings, Si, S2, . . ., s n , and a multiset t = {ti, fe, • • • , t m }, and the 
goal is to find a permutation n : {1, . . . , m} — s> {1, . . . , m} to maximize the number 
of strings in S that are substrings (subsequences) of 7r(t) = ^(1)^(2)— t^( m ) ( we 
call this ordering of the multiset, n(t), a permutation of t). We first show that in 
its most general setting the RCSstr problem is NP-complete and hard to approxi- 
mate within a factor of n 1_e , for any e > 0, unless P = NP. Afterwards, we present 
two separate reductions to show that the RCSstr problem remains NP-Hard even 
in the case where the elements of t are drawn from a binary alphabet or for the 
case where all input strings are of length two. We then present some approxima- 
tion results for several variants of the RCSstr problem. In the second part of this 
paper, we turn to the RCSseq problem, where we present some hardness results, 
tight lower bounds and approximation algorithms. 



1 Introduction 
1.1 Motivation 

In AI planning research it is very important to exploit the interactions between 
different parts of plans. This was observed early in the area [18, 23, 26]. One very 
important type of interaction is the merging of different actions to make the total 
plan more efficient. 

In the general setting we have a set of goals (or tasks) which have to be 
accomplished and we want to find the most cost efficient plan which achieves 
all the goals. This problem is also known as the shortest common superstring in 
the case that every goal has to be done continuously or the shortest common 
supersequence if we can abandon a task and resume its process later. In both 
problems we assume that we have an unlimited set of resources and we want to 
achieve all our goals. Of course, in real life this is never the case: our resources 
are always limited. 



Therefore, a more realistic question is: given a fixed set of resources, how 
many goals can be achieved (continuously or not)? 

It seems that most of the applications of the shortest common superstring 
and the shortest common supersequence problem, are more suitable for the case 
of limited resources. The main challenge for such applications is to find the best 
arrangement that will lead us to accomplish the maximum number of goals. 

As an example, Wilensky [25] gives the scenario where John is planning to 
go camping for a week. He goes to the supermarket to buy a week's worth of 
groceries. John has to achieve a set of goals (i.e. to buy food for meals during 
the camping weekend) and he is able to merge some goals (i.e. to buy different 
products during a single trip to a supermarket) in order to make the plan more 
efficient. 

Another application, from the computational biology area, is the case where 
only the set of amino acids can be determined and not their precise ordering. 
Here we want to know which ordering would maximize the number of short 
strings which can be substrings or subsequences of some ordering of the symbols 
in a given text. 

1.2 Previous work 

In the shortest common supersequence we are given a set S of n strings, s\,S2,- ■ -,s n 
and we want to find the shortest string that is a supersequence of every string 
in S. For arbitrary n the problem is known to be NP-Hard [11] even in the case 
of a binary alphabet [16]. However for fixed n a dynamic programming approach 
takes polynomial time and space. The shortest common supersequence problem 
has been studied extensively both from a theoretical point of view [9, 12, 15, 17], 
from an experimental point of view [1, 5] and from the perspective of its wide 
range of applications in data compression [21], query optimization in database 
systems [20] and text editing [19]. 

In the shortest common superstring problem we are given a set S of n strings, 
si, S2, ■ ■ ■ , s n and we want to find the shortest string that is a superstring of 
every string in S. For arbitrary n the problem is known to be NP- Complete [7] 
and APX-hard [3]. Even for the case of binary alphabet Ott [13] presented lower 
bounds for the achievable approximation ratio. The best known approximation 
ratio so far is 2.5 [10,22]. 

1.3 Our contributions 

We consider the complexity and the approximability of two problems which are 
closely related to the well-known shortest common superstring and shortest com- 
mon supersequence problems. 



Problem 1. (Restricted Common Superstring (Supersequence)) The input con- 
sists of a set S = {si,S2, ■ ■ ■ ,s n } of n strings over an alphabet E and a mul- 
tiset t = {ti, tii ■ ■ ■ i tm} over the same alphabet. The goal is to find an or- 
dering of the multiset t that maximizes the number of strings in S that are 
a substring (subsequence) of the ordered multiset. We denote this ordering by 
7r(t) = ^(1)^(2) ■■■t-K(m) (and we say that ir(t) is a permutation of t). If all 
the strings in S have length at most £, we refer to the problem as RCSstr[£] 
(RCSseq[£]). For simplicity of presentation, we assume throughout that all the 
input strings are distinct and every string Si G S is a substring of at least one 
permutation ir(t). 

Example 1. Let multiset t = {a,a,b,b,c,c} and set S = {abb, bbc,cba,aca} be 
an instance of RCSstr (and also of RCSstr[3]). In this example the maximum 
number of strings from S that can be a substring of a permutation of t is 3. One 
such possible permutation is ir(t) = acabbc which contains the strings aca, abb, 
bbc as substrings. 

Example 2. Let multiset t = {a, a, b, c} and set S = {ab, be, cb, ca} be an instance 
of RCSseq and also RCSseq[2]. In this example the maximum number of strings 
from S that can be a subsequence of a permutation of t is 3. One such possible 
permutation is ir(t) = abca which contains the strings ab, be, ca as a subsequence. 

The paper is organized as follows. In Section 2.1 we study the hardness of the 
RCSstr problem. We show first that in its most general setting the RCS problem 
is NP-complete and hard to approximate within a factor of less than n 1 ~ e , for any 
e > 0, unless P = NP. Then, we show that even if all input strings are of length 
two (RCSstr[2j) and t is a set, i.e. no symbols are repeated, then the RCSstr 
problem is APX-Hard. Afterwards, we prove that the RCSstr problem remains 
NP-Hard even in the case of a binary alphabet. 

In Section 2.2, we design approximation algorithms for several restricted vari- 
ants of the RCSstr problem. We first present a 3/4 approximation algorithm for 
the RCSstr[2] problem where t is a set. Moreover, we give a 1/ (£(£(£ + 1)/2 — 1))- 
approximation algorithm for RCSstr [£], when £ is the length of the longest input 
string. 

The RCSseq problem is studied in Section 3. In Section 3.1 we show that 
the hardness results for RCSstr hold also for RCSseq. Moreover, we show an 
approximation lower bound of l/£\ when £ is the length of the longest input 
string. 

In Section 3.2, we present approximation algorithms for two variants of the 
RCSseq problem. The first is a (1 + ft(l/^/~~A))/2 approximation algorithm for 
RCSstr [2], where A is the number of occurrences of the most frequent character 
in S. Then, for RCSseq we show that a selection of an arbitrary permutation, 



7r(i), yields a randomized approximation algorithm, thus matching the lower 
bound presented in Section 3.1. 

2 RCSstr 

2.1 Hardness of the RCSstr 

In this section we present hardness results for several variants of the RCSstr 
problem. 

We show here that RCSstr problem is NP- complete and hard to approximate 
within a factor better than n 1_e , for any e > 0, unless P = NP. To do so, we 
present an approximation-preserving reduction from the classical maximum clique 
problem. 

Definition 1. (Maximum Clique) Given an undirected graph G = (V,E) the 
maximum clique problem is to find a vertex set V C V of maximum cardinality, 
such that for every two vertices in V' , there exists an edge connecting the two. 

The following seminal hardness result will be useful. 

Theorem 1. [27] The maximum clique problem does not have an n 1 ^ 6 approxi- 
mation, for any e > 0, unless P = NP. 

We can now present our main hardness result of the RCSstr problem. 

Theorem 2. RCSstr is NP-complete and hard to approximate within a factor 
of n 1_e , for any e > 0, unless P = NP. 

Proof. We present an approximation-preserving reduction from the maximum 
clique problem to the RCSstr problem. Given an undirected graph G = (V,E), 
where V = {v i, i>2, • • • , v n }, we construct an instance (S, t) of the RCSstr problem 
in the following way. 

Set t to be -uj , . . . , v™} and for each vertex Vi G V define a string S{ G S 
as follows. Set d{vi) to be the ordered sequence of the vertices not adjacent to Uj. 
Set Si to be vf ■ d(vi), where • denotes concatenation. 

We now prove that the optimal solution of the RCSstr instance (S, t) has size 
x if and only if the optimal solution of maximum clique problem on the graph G 
has size x. 

Let 7r be a permutation on the multiset t and let A C S be all the strings that 
are substrings of 7r(i). Denote by A' the set of vertices in G corresponding to the 
set of strings A. We prove that the vertices in A' form a clique. Suppose that this 
is not true and there exist two vertices Vi,Vj G A' such that (vi,Vj) ^ E. Note 
that, in any common superstring of the strings Sj and Sj either V{ or Vj must have 



at least n + 1 occurrences, since Vi is not present in the neighbors list of Vj and 
vice versa. This is a contradiction since the multiset t has only n copies of each 
character. Therefore the set of vertices A' forms a clique. 

On the other hand, let A' = {y\, . . . , Vk} Q V be a clique and let A = 
{si, . . . , Sk} C S be the set of corresponding strings. We can find a permutation 
of t which contains all the strings in A as a substring by concatenating s\,. . . ,Sk 
and appending the remaining characters arbitrarily at the end. No character is 
used more than n times since the vertices from A' form a clique and, therefore, 
Vi ^ d(vj) for any Vi,Vj G A'. 

Thus, the RCSstr problem is NP-complete and hard to approximate within a 
factor n 1_e , for any e > 0, unless P = NP. □ 

We now show that the RCSstr [2] problem is APX-Hard even if t is a set, i.e. 
each character in t is unique. To do so, we present an approximation-preserving re- 
duction from the classical Asymmetric maximum TSP problem with edge weights 
of and 1. 

Definition 2. (Maximum Asymmetric Travelling Salesman Problem) 
Given a complete weighted directed graph G = (V, E) the Maximum Travelling 
Salesman Problem is to find a closed tour of maximum weight visiting all vertices 
exactly once. 

Theorem 3. [6] For any constant e > 0, it is NP-Hard to approximate the Max- 
imum Asymmetric Travelling Salesman with 0, 1 edge weights within 320/321 + e. 

The hardness result for the RCSstr[2] problem is stated in the following the- 
orem. 

Theorem 4. There exists a constant (3 > 0, such that the RCSstr problem is 
NP-Hard to approximate within a factor of 1 — (3, even if all the strings in S have 
length two and t is a set. 

Proof. We present a gap-preserving reduction from the maximum asymmetric 
TSP to the RCSstr[2] problem where t is a set. 

Given a complete directed graph G = (V, E), with \V\ = n, \E\ = n(n — l)/2 
and edge weights of and 1, we construct an instance (S, t) of the RCSstr[2] 
problem in the following way. 

Set t = V and for each arc (a, b) G E with weight 1 set a string ab in S. Let 
OPT(G) be the length of the optimal tour on the graph G and let OPT(S, t) be 
the maximum number of strings from S which can be substrings of a permutation 
of t. In order to have an inapproximability factor less than 1, we also assume that 
n > 322. 

We now prove that the reduction presented is a gap-preserving reduction. 
Specifically, we prove that: 



OPT{G) = n => OPT(S, t) = n-l 



OPT{G) < (1 - a)n OPT(S,t) < (1 - 0)(n - 1) 

where a > and /3 > are constants which are defined later. The permutation 
v±V2 ■ ■ ■ v n corresponding to a tour of length n contains n — 1 strings from S as 
substrings: v±V2, V2V3, . . . ,v n -±v n . Therefore, the first implication is true. 

Suppose now that OPT(G) < (1 — a)n. Then, OPT(S,t) < (1 — a)n, since 
a permutation of t defines a path in the graph, which is shorter than a tour. We 
want to find a constant j3 such that (1 — a)n < (1 — P)(n — 1). The following 
inequality gives the desired. 



Therefore, if the maximum ATSP problem does not admit a 1 — a approxi- 
mation, then the RCSstr[2] problem (even in case that t is a set) does not admit 
a 1 — P approximation (the reader may refer to [24] for a more detailed argu- 
ment of this claim). From Theorem 3, we know that is hard to approximate 
the Maximum Asymmetric Travelling Salesman with 0, 1 edge weights within 
320/321 + e, for any e > 0. Therefore, our problem is inapproximable within 1 - 
P > n(320/321 + e)/(n - 1), for any e > 0. 

□ 

We now show that even over a binary alphabet the RCSstr problem remains 
NP-Hard. 

Theorem 5. If\S\ = 2, then the RCSstr problem is NP-Hard. 

Proof. Let £ = {0,1}. We prove that if we can solve the RCSstr problem on 
the alphabet S in polynomial time, then we can solve in polynomial time the 
shortest common superstring problem on the alphabet S. 

Consider a shortest common superstring instance S, where the longest string 

has length t. It is easy to see that s\ ■ S2 s n is a superstring of all the strings in 

S. Hence, the solution is no longer than rtt. We show that 0(n 2 £ 2 ) calls to RCSstr 
are sufficient to find the shortest common superstring of the given strings. 

We name an RCSstr instance (S, t) complete, if all the strings of S are sub- 
strings of the optimal solution 7r(t). 

Note that there exists a string x with i 0's and j l's that is a common 
superstring of all the strings in S if and only if the RCSstr instance (S, Q l V) is 
complete. Therefore, we want to find the shortest string t such that the RCSstr 
instance (S,t) is complete. The shortest common superstring is given by the 
permutation ir(t) returned by calling the RCSstr on the instance (S, t). The 



number of multisets l V where i + j < n£ is 0(n 2 £ 2 ). Therefore we can call the 
RCSstr on all of them and we can find the shortest common superstring on the 
given strings in polynomial time (note that this time can be improved somewhat 
by employing a binary search). The shortest common superstring problem is 
NP-Hard and the theorem follows. 

□ 

2.2 Approximating RCSstr 

In the this section we present approximation algorithms for two variants of the 
RCSstr problem. 

We first present a 3/4- approximation algorithm for the RCSstr [2] problem 
where each character of t is unique. Our algorithm follows immediately from the 
NP-Hardness reduction presented in the previous section. Since each character in 
t is unique we can construct a complete directed graph G = (V, E), with V = U 
as in the proof of Theorem 4. We then apply the 3/4 approximation algorithm 
for the Maximum ATSP and we obtain a cycle i^i), ^(2)7 • • • , ^(71)^^(1) °f total 
weight k, where it : {1, . . . , n} — > {1, . . . , n} is a permutation. 

If, for some % < n, i^^+i) S, we output ^+1)^+2) • • • ^(n-i) *^(n)*7r(i)*7ri 
that contains k strings from S as substrings (and yields an approximation ratio of 
3/4). Otherwise, we output ^(1)^(2) • • • *7r(n-i)*ir(n) that contains exactly n — 1 
strings from S as substrings, which is optimal. 

Here we present a simple 1/ (£(£(£ + l)/2 — l))-approximation algorithm for 
RCSstr[£j. 

The idea is output a concatenation of a maximal collection of strings from S. 
One can observe that each of the £ characters of a string in our solution cannot 
be used by more than £{£ + l)/2 — 1 strings in the optimal solution. Therefore, 
the algorithm yields a 1/ '(£(£(£ + l)/2 — l))-approximation ratio. Formally, the 
algorithm is presented below. 

Algorithm 1: A l/(£(£(£+l)/2— 1)) approximation algorithm for RCSstr[£] 
Find a maximal subset S' = , s' 2 , . . . , s' q C S of strings under the 
following constraint: there exists a permutation 7r(i) of the multiset such 

that s[ • s' 2 s' q is a prefix of n(t). 

Output: ir(t) 



Theorem 6. Algorithm 1 is a !/{£{£{£+ l)/2— 1))- approximation algorithm for 
RCSstr[£}. 

Proof. Note that, a single character can be used simultaneously in at most £{£ + 
l)/2 — 1 strings of the optimal solution. Since for every Si G S, |sj| < £, we can 
conclude that a single string in our solution can cause at most £{£{£ + l)/2 — 1) 



other strings of the optimal solution not to be chosen. Thus, the size of the optimal 
solution is at most q{£{£{£ + l)/2 — 1)) and the approximation ratio follows. □ 

One tight example for the above analysis of Algorithm 1 is the following: 
t = {a,b,c,q,q,q, z, z, z,w,w,w, x,x,x}, and S = {abc, qa, az, wqa, qaz, azx, 
qb, bz, wqb, qbz, bzx, qc, cz, wqc, qcz, czx}. If we first select into the maximal 
collection the string abc, then we cannot add any other string to our solution. 
The optimal solution has size 15 and consists of all the other strings. 

Observation 1. Given an RCSstr[£] instance, if for every Sj G S, Sj is not a 
substring of any other Sj G S, then Algorithm 1 is an £ 2 -approximation algorithm. 

Proof Note that, a single character can be used simultaneously in at most £ 
strings of the optimal solution, thus, a single string in our solution can stop at 
most £ 2 other strings of the optimal solution from being placed. □ 

One can notice that, in case that all input strings are of length £ the above 
observation must holds. 

3 RCSseq 

We now turn to the RCSseq problem. We first present hardness results and 
lower bound for several variants of the RCSseq problem and then we present two 
approximation algorithms. 

3.1 Hardness of the RCSseq problem 

In the following theorem we show that the hardness result for the general RCSstr 
holds also to the RCSseq. 

Theorem 7. RCSseq is NP-complete and hard to approximate within a factor 
n l ~ e , for any e > 0, unless P = NP. 

Proof. Omitted (similar to the proof of Theorem 2). 

Moreover, we state that even over a binary alphabet the RCSseq problem 
remains NP-Hard. 

Theorem 8. If \S\ = 2, then the RCSseq problem is NP-Hard. 

Proof. Omitted (similar to the proof of Theorem 3). 

We now prove that RCSseq is APX-Hard even if all the input strings are 
of length two and t is a set. To do so, we present an approximation- preserving 
reduction from the classical maximum acyclic subgraph problem. 



Definition 3. (Maximum Acyclic Subgraph) Given a directed graph G = (V,E) 
the maximum acyclic subgraph problem is to find a subset A of the arcs such that 
G' = (V, A) is acyclic and A has maximum cardinality. 

Theorem 9. [14] The Maximum Acyclic Subgraph problem is APX-Complete. 

We can now present our hardness result. 

Theorem 10. RCSseq is APX-Hard even if all the strings in S have length two 
and t is a set. 

Proof. We present an approximation-preserving reduction from the maximum 
acyclic subgraph problem. Given a directed graph G = (V, E) we construct an 
instance (S, t) of the RCSseq problem as follows. Set t = V and for every arc 
(a, b) G E we add a string ab to S. 

Let 7T be a permutation of the set t and let A C S be all the strings that are 
subsequences of 7r(t). The corresponding edge set A is an acyclic subgraph of G. 
On the other hand, let A C E be an acyclic subgraph. Consider a topological 
ordering of (V,A). All strings corresponding to edges A are subsequences of 7r(t) 
that corresponds to the topological ordering. 

Note that the optimal solution of the RCSseq instance (S, t) has size x if and 
only if the optimal solution of maximum acyclic subgraph problem on the graph 
G has size x. Thus, the RCSseq problem is APX-Hard. □ 

In [8] the following result is proven. 

Theorem 11. The maximum acyclic subgraph problem is Unique-Games hard 
to approximate within a factor better than the trivial 1/2 achieved by a random 
ordering. 

The maximum acyclic subgraph is a special case of permutation constraint 
satisfaction problem (permCSP). A permCSP of arity k is specified by a subset S 
of permutations on {1,2,..., k}. An instance of such a permCSP consists of a set 
of variables V and a collection of constraints each of which is an ordered /c-tuple 
of V. The objective is to find a global ordering a of the variables that maximizes 
the number of constraint tuples whose ordering (under a) follows a permutation 
in S. In [4] Charikar, Guruswami and Manokaran prove the following result. 

Theorem 12. For every permCSP of arity 3, beating the random ordering is 
Unique-Games hard. 

Our problem corresponds a permCSP where S contains only the identical 
permutation. Therefore we can conclude the following. 



Theorem 13. RCSseq[2] is Unique-Games hard to approximate within a factor 
better than 1/2. 

Theorem 14. RCSseq[3] is Unique-Games hard to approximate within a factor 
better than 1/6. 

Currently there is an unpublished result by Charikar, Hastad and Guruswami 
stating that every fc-ary permCSP is approximation resistant. This implies that 
RCSseq[£] cannot have an approximation algorithm better than \ 

3.2 Approximating RCSseq 

In the this subsection we present a (1 + i?(l/\/2))/2 approximation algorithm 
for the RCSseq[2] problem where A is the number of occurrences of the most 
frequent character in S. We also present a simple randomized approximation 
algorithm which achieves an approximation ratio of 

Theorem 15. [2] The maximum acyclic subgraph problem is approximable within 
(1 + Q{l/y/A))/2, where A is the maximum degree of a node in the graph. 

Given a multiset t, let P' be the set of characters that have a single occurrence 
in t and let P be 2J\P', where E is the alphabet of t. We define Q to be the 
following multiset. For every a G P, if a has r occurrences in t, then a has r — 2 
occurrences Q. 

Algorithm 2: A (1 + fi(l/-\/~A))/2 approximation algorithm for RCSseq2 

1. Given a multiset t, construct a graph G = (V,E) such that: 
Vi G V iff Vi G P' and (a, b) G E iff a,b G P' and ab G S. 

2. Apply the (1 + fl(l/y/~A))/2 approximation algorithm for the maximum 
acyclic subgraph to the graph G. Denote the output subgraph by G'(V,E'). 

3. Let F' be a topological order of the vertices of G'. 

Let F and F" be an arbitrary ordering of P and Q respectively. 

4. Output F-F'-FF". 



Figure 1 is an example of Algorithm 2. In the first stage we construct a graph 
according to the first two steps, note that P = {e}, P' = {a, b, c, d} and Q = 0. 
Then we present an acyclic directed subgraph and we output F-F'-F-F", where 
F = e and F' = cadb. 

Theorem 16. Algorithm 2 is a (1 + f2(l/y/~A))/2 approximation algorithm for 
the RCSseq[2] problem, where A is the maximum number of occurrences of a 
character in the set S. 



t = abcdee 
s i = ab 
s 2 =bc 
s 3= ca 
s 4 = cd 
s 5 = db 
s 6 =ec 
s 7=ea 
s 8 = ee 
s 9 =be 

Fig. 1. Algorithm 2 example. 



Proof. Given a string ab G S. If a G P or b G P (or both), then a& is always a 
subsequence of F ■ F' ■ F. Otherwise, if both a and 6 appear only once in t, then 
a6 is a subsequence of P • F' ■ F if only if the edge (a, b) is selected in the arc set 
of the maximum acyclic subgraph. Since the maximum acyclic subgraph problem 
has an approximation ratio of (1 + ft(l/\/~~A))/2, the same approximation ratio 
holds for RCSseq2 problem. □ 

We now deal with RCSseqfi] instances. We show that selecting an arbitrary 
permutation ir(t) achieves an expected approximation ratio of jf. 

We define by P(sj,7r(t)) the probability that a string Si G S is a subsequence 
of a permutation vr(t). 

Note that, P(sj,7r(i)) > v 1 1 ^ '- = ^. Therefore, the expected number of 

strings from S to be subsequences of an arbitrary permutation ir(t) > '-J-. Thus, 
selecting an arbitrary permutation 7r(t) achieves an expected approximation ratio 

of flit legist — £|" . 
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